Machine Learning (Spring 2022)

Hands-On 1

</div>
Amirali Soltani Tehrani

Python

Python) is an interpreted high-level general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small- and large-scale projects.

Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.

Guido van Rossum began working on Python in the late 1980s, as a successor to the ABC programming language, and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000 and introduced new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and Unicode support. Python 3.0, released in 2008, was a major revision that is not completely backward-compatible with earlier versions. Python 2 was discontinued with version 2.7.18 in 2020.

Text editor/IDE options¶

  • PyCharm (IDE)
  • Visual Studio Code (IDE)
  • Sublime Text (IDE)
  • Atom
  • Notepad ++/gedit
  • Vim (for Linux)

Basic Python

String Manipulation¶

Formatting¶

In [38]:
print('I love Machine Learning course.'.upper()+' (upper)')
print('I love Machine Learning course.'.rjust(2)+ ' (rjust 20)')
print('i love Machine Learning course.'.capitalize()+ ' (capitalize)')
print('       I love Machine Learning course.       '.strip()+ ' (strip)')
I LOVE MACHINE LEARNING COURSE. (upper)
I love Machine Learning course. (rjust 20)
I love machine learning course. (capitalize)
I love Machine Learning course. (strip)

Concatenation¶

In [45]:
ml_sem_code = 2022

print('I like ' + str(ml_sem_code) + ' a lot!')
print(f'{print} (print a function)')
print(f'{type(229)} (print a type)')
I like 2022 a lot!
<built-in function print> (print a function)
<class 'int'> (print a type)

Formatting¶

In [3]:
txt = "For only {price:.2f} dollars!"
print(txt.format(price = 49)) 
For only 49.00 dollars!

Lists¶

List Creation¶

In [13]:
list_1 = ['one', 'two', 'three']

Insertion/extension¶

In [14]:
list_1.append(4)
list_1.insert(0, 'ZERO')

print(list_1)
['ZERO', 'one', 'two', 'three', 4]
In [15]:
list_2 = [1, 2, 3]
list_1.extend(list_2)

print(list_1)
['ZERO', 'one', 'two', 'three', 4, 1, 2, 3]

List comprehension¶

In [17]:
long_list = [i for i in range(9)]
long_long_list = [(i, j) for i in range(3)
for j in range(5)]

print(long_long_list)

long_list_list = [[i for i in range(3)]
for _ in range(5)]

print(long_list)
print(long_long_list)
[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4)]
[0, 1, 2, 3, 4, 5, 6, 7, 8]
[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4)]

Sorting¶

In [19]:
random_list_2 = [(3, 'z'), (12, 'r'), (6, 'e'),
(8, 'c'), (2, 'g')]
sorted(random_list_2, key=lambda x: x[1])
                                       
print(random_list_2)
[(3, 'z'), (12, 'r'), (6, 'e'), (8, 'c'), (2, 'g')]

Dictionary and Set¶

Set (unordered, unique)¶

In [20]:
my_set = {i ** 2 for i in range(10)}

print(my_set)
{0, 1, 64, 4, 36, 9, 16, 49, 81, 25}

Dictionary (mapping)¶

In [21]:
my_dict = {(5-i): i ** 2 for i in range(10)}

print(my_dict)
{5: 0, 4: 1, 3: 4, 2: 9, 1: 16, 0: 25, -1: 36, -2: 49, -3: 64, -4: 81}

Dictionary update¶

In [22]:
second_dict = {'a': 10, 'b': 11}
my_dict.update(second_dict)

print(my_dict)
{5: 0, 4: 1, 3: 4, 2: 9, 1: 16, 0: 25, -1: 36, -2: 49, -3: 64, -4: 81, 'a': 10, 'b': 11}

Iterate through items¶

In [23]:
for k, it in my_dict.items():
    print(k, it)
5 0
4 1
3 4
2 9
1 16
0 25
-1 36
-2 49
-3 64
-4 81
a 10
b 11

Numpy

  • Package for scientific computing in Python
  • Vector and matrix manipulation
  • Broadcasting and vectorization (matrix operations): saves time & cleans up code

Some Useful Numpy Functions¶

Python Command Description
np.linalg.inv Inverse of matrix (numpy as equivalent)
np.linalg.eig Get eigen values & eigen vectors of arr
np.matmul Matrix multiply
np.zeros/ones Create a matrix filled with zeros/ones
np.arange Start, stop, step size (more np.linspace)
np.identity Create an identity matrix
np.vstack Vertically stack 2 arrays (more np.hstack)

Some Debugging Tools¶

Python Command Description
array.shape Get shape of numpy array
array.dtype Check data type of array (for precision, for weird behavior)
type(stuff) Get type of a variable
import pdb; pdb.set_trace() Set a breakpoint

Basic Numpy Usage¶

Initialization from Python lists¶

In [59]:
import numpy as np

array_1d = np.array([1, 2, 3, 4])
print(array_1d)
array_1by4 = np.array([[1, 2, 3, 4]])
print(array_1by4)
large_array = np.array([i for i in range(36)])
print(large_array)
large_array = large_array.reshape((6, 6))
print(large_array)
[1 2 3 4]
[[1 2 3 4]]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35]
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]
 [24 25 26 27 28 29]
 [30 31 32 33 34 35]]

Lists with different types¶

In [28]:
from_list = np.array([1, 2, 3])
from_list_2d = np.array([[1, 2, 3.0], [4, 5, 6]])
from_list_bad_type = np.array([1, 2, 3, 'a'])
print(f'Data type of integer is {from_list.dtype}')
print(f'Data type of float is {from_list_2d.dtype}')
Data type of integer is int64
Data type of float is float64

NumPy algebra functions¶

In [30]:
array_1 = np.array([1, 2, 3])
array_1 + 5
array_1 * 5
np.sqrt(array_1)
np.power(array_1, 2)
np.exp(array_1)
np.log(array_1)
Out[30]:
array([0.        , 0.69314718, 1.09861229])

Dot product and matrix multiplication¶

Ways to write dot product¶

In [32]:
array_1 = np.array([1, 2, 3])
array_2 = array_1

array_1 @ array_2
array_1.dot(array_2)
np.dot(array_1, array_2)
Out[32]:
14

Matrix multiplication like $Ax$¶

In [33]:
weight_matrix = np.array([1, 2, 3, 4]).reshape(2, 2)
sample = np.array([[50, 60]]).T
np.matmul(weight_matrix, sample)
Out[33]:
array([[170],
       [390]])

2D matrix multiplication¶

In [34]:
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
np.matmul(mat1, mat2)
Out[34]:
array([[19, 22],
       [43, 50]])

Element-wise multiplication¶

In [35]:
a = np.array([i for i in range(10)]).reshape(2, 5)
a * a
np.multiply(a, a)
np.multiply(a, 10)
Out[35]:
array([[ 0, 10, 20, 30, 40],
       [50, 60, 70, 80, 90]])

Broadcasting¶

NumPy compares dimensions of operands, then infers missing/mismatched dimensions so the operation is still valid. Be careful with dimensions!

In [36]:
op1 = np.array([i for i in range(9)]).reshape(3, 3)
op2 = np.array([[1, 2, 3]])
op3 = np.array([1, 2, 3])
In [39]:
# Results are different here
print(op1 + op2)
print(op1 + op2.T)
[[ 1  3  5]
 [ 4  6  8]
 [ 7  9 11]]
[[ 1  2  3]
 [ 5  6  7]
 [ 9 10 11]]
In [40]:
# Results are same here
print(op1 + op3)
print(op1 + op3.T)
[[ 1  3  5]
 [ 4  6  8]
 [ 7  9 11]]
[[ 1  3  5]
 [ 4  6  8]
 [ 7  9 11]]

Broadcasting for pairwise distance¶

In [47]:
samples = np.random.random((3, 5))

# Without broadcasting
expanded1 = np.expand_dims(samples, axis=1)
tile1 = np.tile(expanded1, (1, samples.shape[0], 1))
expanded2 = np.expand_dims(samples, axis=0)
tile2 = np.tile(expanded2, (samples.shape[0], 1 ,1))
diff = tile2 - tile1
distances = np.linalg.norm(diff, axis=-1)
print(distances)

# With broadcasting
diff = samples[: ,np.newaxis, :]
- samples[np.newaxis, :, :]
distances = np.linalg.norm(diff, axis=-1)
print(distances)

# Also could use scipy
import scipy.spatial
distances = scipy.spatial.distance.cdist(samples, samples)
print(distances)
[[0.         0.497356   1.07883004]
 [0.497356   0.         1.087224  ]
 [1.07883004 1.087224   0.        ]]
[[1.21238139]
 [1.15884709]
 [1.47317763]]
[[0.         0.497356   1.07883004]
 [0.497356   0.         1.087224  ]
 [1.07883004 1.087224   0.        ]]

Why using vectors?¶

Shorter code, faster execution! Look at these examples.

In [61]:
import time

a = np.random.random(500000)
b = np.random.random(500000)

# Using NumPy dot product
t = time.time()
dot = np.array(a).dot(np.array(b))
print("Execution time with numpy: " + str(time.time() - t))
print(dot)

# Using for loops
dot = 0.0
t = time.time()
for i in range(len(a)):
    dot += a[i] * b[i]
print("Execution time with for loop: " + str(time.time() - t))
print(dot)
Execution time with numpy: 0.006165742874145508
124656.76023488915
Execution time with for loop: 0.2621474266052246
124656.7602348879

Speed up depends on setup and nature of computation!

In [63]:
samples = np.random.random((100, 5))

# Using NumPy with broadcasting
t = time.time()
diff = samples[: ,np.newaxis, :] - samples[np.newaxis, :, :]
distances = np.linalg.norm(diff, axis=-1)
avg_dist = np.mean(distances)
print("Execution time with numpy: " + str(time.time() - t))
print(avg_dist)

# Using for loops
t = time.time()
total_dist = []
for s1 in samples:
    for s2 in samples:
        d = np.linalg.norm(s1 - s2)
        total_dist.append(d)
avg_dist = np.mean(total_dist)
print("Execution time with for loop: " + str(time.time() - t))
print(avg_dist)
Execution time with numpy: 0.001271963119506836
0.8475971610622879
Execution time with for loop: 0.06235051155090332
0.8475971610622879

Tools for Plotting

Matplotlib/Seaborn¶

Mostly used for Visualization (line, scatter, bar, images and even interactive 3D)

Example plots¶

In [65]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

# Plotting
fig, ax = plt.subplots()
ax.plot(t, s)

# Format plotting
ax.set(xlabel='time (s)', ylabel='voltage (mV)',
title='About as simple as it gets, folks')
ax.grid()

# Save/show
fig.savefig("test.png")
plt.show()

Plot with dash lines and legend¶

In [66]:
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 500)
y = np.sin(x)
fig, ax = plt.subplots()
line1, = ax.plot(x, y, label='Using set_dashes()')

# 2pt line, 2pt break, 10pt line, 2pt break
line1.set_dashes([2, 2, 10, 2])
line2, = ax.plot(x, y - 0.2, dashes=[6, 2],
label='Using the dashes parameter')
ax.legend()
plt.show()

Using subplot¶

In [67]:
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Setup grid with height 2 and col 1.
# Plot the 1st subplot
plt.subplot(2, 1, 1)
plt.grid()
plt.plot(x, y_sin)
plt.title('Sine Wave')

# Now plot on the 2nd subplot
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine Wave')
plt.grid()
plt.tight_layout()

Confusion Matrix Plot with Matplotlib¶

Specially used in ROC curves

In [72]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay

# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel="linear", C=0.01).fit(X_train, y_train)

np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
titles_options = [
    ("Confusion matrix, without normalization", None),
    ("Normalized confusion matrix", "true"),
]
for title, normalize in titles_options:
    disp = ConfusionMatrixDisplay.from_estimator(
        classifier,
        X_test,
        y_test,
        display_labels=class_names,
        cmap=plt.cm.Blues,
        normalize=normalize,
    )
    disp.ax_.set_title(title)

    print(title)
    print(disp.confusion_matrix)

plt.show()
Confusion matrix, without normalization
[[13  0  0]
 [ 0 10  6]
 [ 0  0  9]]
Normalized confusion matrix
[[1.   0.   0.  ]
 [0.   0.62 0.38]
 [0.   0.   1.  ]]

Pandas¶

Mostly used for

  • DataFrame (database/Excel-like)
  • Easy filtering, aggregation (also plotting, but less features than dedicated datavis packages)

Pandas is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in Pandas are implemented with Series and DataFrame classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of Series instances. DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

In [73]:
import numpy as np
import pandas as pd

pd.set_option("display.precision", 2)
In [83]:
df = pd.read_csv("churn-bigml-80.csv")
df.head()
Out[83]:
State Account length Area code International plan Voice mail plan Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes Total eve calls Total eve charge Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn
0 KS 128 415 No Yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 3 2.70 1 False
1 OH 107 415 No Yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.70 1 False
2 NJ 137 415 No No 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 5 3.29 0 False
3 OH 84 408 Yes No 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 False
4 OK 75 415 Yes No 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 False

Let’s have a look at data dimensionality, feature names, and feature types.

In [84]:
print(df.shape)
(3333, 20)

From the output, we can see that the table contains 3333 rows and 20 columns.

In [85]:
print(df.columns)
Index(['State', 'Account length', 'Area code', 'International plan',
       'Voice mail plan', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls',
       'Churn'],
      dtype='object')

We can use the info() method to output some general information about the dataframe:

In [86]:
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   State                   3333 non-null   object 
 1   Account length          3333 non-null   int64  
 2   Area code               3333 non-null   int64  
 3   International plan      3333 non-null   object 
 4   Voice mail plan         3333 non-null   object 
 5   Number vmail messages   3333 non-null   int64  
 6   Total day minutes       3333 non-null   float64
 7   Total day calls         3333 non-null   int64  
 8   Total day charge        3333 non-null   float64
 9   Total eve minutes       3333 non-null   float64
 10  Total eve calls         3333 non-null   int64  
 11  Total eve charge        3333 non-null   float64
 12  Total night minutes     3333 non-null   float64
 13  Total night calls       3333 non-null   int64  
 14  Total night charge      3333 non-null   float64
 15  Total intl minutes      3333 non-null   float64
 16  Total intl calls        3333 non-null   int64  
 17  Total intl charge       3333 non-null   float64
 18  Customer service calls  3333 non-null   int64  
 19  Churn                   3333 non-null   bool   
dtypes: bool(1), float64(8), int64(8), object(3)
memory usage: 498.1+ KB
None

bool, int64, float64 and object are the data types of our features. We see that one feature is logical (bool), 3 features are of type object, and 16 features are numeric. With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 3333 observations, the same number of rows we saw before with shape.

We can change the column type with the astype method. Let’s apply this method to the Churn feature to convert it into int64:

In [87]:
df["Churn"] = df["Churn"].astype("int64")

The describe method shows basic statistical characteristics of each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

In [88]:
df.describe()
Out[88]:
Account length Area code Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes Total eve calls Total eve charge Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn
count 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00 3333.00
mean 101.06 437.18 8.10 179.78 100.44 30.56 200.98 100.11 17.08 200.87 100.11 9.04 10.24 4.48 2.76 1.56 0.14
std 39.82 42.37 13.69 54.47 20.07 9.26 50.71 19.92 4.31 50.57 19.57 2.28 2.79 2.46 0.75 1.32 0.35
min 1.00 408.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 23.20 33.00 1.04 0.00 0.00 0.00 0.00 0.00
25% 74.00 408.00 0.00 143.70 87.00 24.43 166.60 87.00 14.16 167.00 87.00 7.52 8.50 3.00 2.30 1.00 0.00
50% 101.00 415.00 0.00 179.40 101.00 30.50 201.40 100.00 17.12 201.20 100.00 9.05 10.30 4.00 2.78 1.00 0.00
75% 127.00 510.00 20.00 216.40 114.00 36.79 235.30 114.00 20.00 235.30 113.00 10.59 12.10 6.00 3.27 2.00 0.00
max 243.00 510.00 51.00 350.80 165.00 59.64 363.70 170.00 30.91 395.00 175.00 17.77 20.00 20.00 5.40 9.00 1.00

For categorical (type object) and boolean (type bool) features we can use the value_counts method. Let’s take a look at the distribution of Churn:

In [94]:
df["Churn"].value_counts()
Out[94]:
0    2850
1     483
Name: Churn, dtype: int64
In [96]:
df["Churn"].value_counts(normalize=True)
Out[96]:
0    0.86
1    0.14
Name: Churn, dtype: float64

Sorting¶

A DataFrame can be sorted by the value of one of the variables (i.e columns). For example, we can sort by Total day charge (use ascending=False to sort in descending order):

In [98]:
df.sort_values(by="Total day charge", ascending=False).head()
Out[98]:
State Account length Area code International plan Voice mail plan Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes Total eve calls Total eve charge Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn
296 CO 154 415 No No 0 350.8 75 59.64 216.5 94 18.40 253.9 100 11.43 10.1 9 2.73 1 1
780 NY 64 415 Yes No 0 346.8 55 58.96 249.5 79 21.21 275.4 102 12.39 13.3 9 3.59 1 1
2087 OH 115 510 Yes No 0 345.3 81 58.70 203.4 106 17.29 217.5 107 9.79 11.8 8 3.19 1 1
128 OH 83 415 No No 0 337.4 120 57.36 227.4 116 19.33 153.9 114 6.93 15.8 7 4.27 0 1
485 MO 112 415 No No 0 335.5 77 57.04 212.5 109 18.06 265.0 132 11.93 12.7 8 3.43 2 1
In [99]:
df.sort_values(by=["Churn", "Total day charge"], ascending=[True, False]).head()
Out[99]:
State Account length Area code International plan Voice mail plan Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes Total eve calls Total eve charge Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn
2812 MN 13 510 No Yes 21 315.6 105 53.65 208.9 71 17.76 260.1 123 11.70 12.1 3 3.27 3 0
1818 NC 210 415 No Yes 31 313.8 87 53.35 147.7 103 12.55 192.7 97 8.67 10.1 7 2.73 3 0
2770 LA 67 510 No No 0 310.4 97 52.77 66.5 123 5.65 246.5 99 11.09 9.2 10 2.48 4 0
460 SD 114 415 No Yes 36 309.9 90 52.68 200.3 89 17.03 183.5 105 8.26 14.2 2 3.83 1 0
2302 AL 141 510 No Yes 28 308.0 123 52.36 247.8 128 21.06 152.9 103 6.88 7.4 3 2.00 1 0

Indexing and retrieving data¶

A DataFrame can be indexed in a few different ways.

To get a single column, you can use a DataFrame['Name'] construction. Let’s use this to answer a question about that column alone: what is the proportion of churned users in our dataframe?

In [101]:
df["Churn"].mean()
Out[101]:
0.14491449144914492

Boolean indexing with one column is also very convenient. The syntax is df[P(df['Name'])], where P is some logical condition that is checked for each element of the Name column. The result of such indexing is the DataFrame consisting only of rows that satisfy the P condition on the Name column.

Let’s use it to answer the question:

What are average values of numerical features for churned users?

In [102]:
df[df["Churn"] == 1].mean()
/tmp/ipykernel_4094/507517496.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df[df["Churn"] == 1].mean()
Out[102]:
Account length            102.66
Area code                 437.82
Number vmail messages       5.12
Total day minutes         206.91
Total day calls           101.34
Total day charge           35.18
Total eve minutes         212.41
Total eve calls           100.56
Total eve charge           18.05
Total night minutes       205.23
Total night calls         100.40
Total night charge          9.24
Total intl minutes         10.70
Total intl calls            4.16
Total intl charge           2.89
Customer service calls      2.23
Churn                       1.00
dtype: float64

How much time (on average) do churned users spend on the phone during daytime?

In [103]:
df[df["Churn"] == 1]["Total day minutes"].mean()
Out[103]:
206.91407867494823

What is the maximum length of international calls among loyal users (Churn == 0) who do not have an international plan?

In [104]:
df[(df["Churn"] == 0) & (df["International plan"] == "No")]["Total intl minutes"].max()
Out[104]:
18.9
In [106]:
df.loc[0:5, "State":"Area code"]
Out[106]:
State Account length Area code
0 KS 128 415
1 OH 107 415
2 NJ 137 415
3 OH 84 408
4 OK 75 415
5 AL 118 510

Applying Functions to Cells, Columns and Rows¶

In [107]:
df.apply(np.max)
Out[107]:
State                        WY
Account length              243
Area code                   510
International plan          Yes
Voice mail plan             Yes
Number vmail messages        51
Total day minutes         350.8
Total day calls             165
Total day charge          59.64
Total eve minutes         363.7
Total eve calls             170
Total eve charge          30.91
Total night minutes       395.0
Total night calls           175
Total night charge        17.77
Total intl minutes         20.0
Total intl calls             20
Total intl charge           5.4
Customer service calls        9
Churn                         1
dtype: object

The apply method can also be used to apply a function to each row. To do this, specify axis=1. Lambda functions are very convenient in such scenarios. For example, if we need to select all states starting with ‘W’, we can do it like this:

In [108]:
df[df["State"].apply(lambda state: state[0] == "W")].head()
Out[108]:
State Account length Area code International plan Voice mail plan Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes Total eve calls Total eve charge Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn
8 WV 141 415 Yes Yes 37 258.6 84 43.96 222.0 111 18.87 326.4 97 14.69 11.2 5 3.02 0 0
22 WY 57 408 No Yes 39 213.0 115 36.21 191.1 112 16.24 182.7 115 8.22 9.5 3 2.57 0 0
38 WI 64 510 No No 0 154.0 67 26.18 225.8 118 19.19 265.3 86 11.94 3.5 3 0.95 1 0
41 WY 97 415 No Yes 24 133.2 135 22.64 217.2 58 18.46 70.6 79 3.18 11.0 3 2.97 1 0
45 WY 87 415 No No 0 151.0 83 25.67 219.7 116 18.67 203.9 127 9.18 9.7 3 2.62 5 1

The map method can be used to replace values in a column by passing a dictionary of the form {old_value: new_value} as its argument:

In [109]:
d = {"No": False, "Yes": True}
df["International plan"] = df["International plan"].map(d)
df.head()
Out[109]:
State Account length Area code International plan Voice mail plan Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes Total eve calls Total eve charge Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn
0 KS 128 415 False Yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 3 2.70 1 0
1 OH 107 415 False Yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.70 1 0
2 NJ 137 415 False No 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 5 3.29 0 0
3 OH 84 408 True No 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 0
4 OK 75 415 True No 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 0

Grouping¶

In general, grouping data in Pandas works as follows:

In [ ]:
df.groupby(by=grouping_columns)[columns_to_show].function()
  1. First, the groupby method divides the grouping_columns by their values. They become a new index in the resulting dataframe.

  2. Then, columns of interest are selected (columns_to_show). If columns_to_show is not included, all non groupby clauses will be included.

  3. Finally, one or several functions are applied to the obtained groups per selected columns.

In [110]:
columns_to_show = ["Total day minutes", "Total eve minutes", "Total night minutes"]

df.groupby(["Churn"])[columns_to_show].describe(percentiles=[])
Out[110]:
Total day minutes Total eve minutes Total night minutes
count mean std min 50% max count mean std min 50% max count mean std min 50% max
Churn
0 2850.0 175.18 50.18 0.0 177.2 315.6 2850.0 199.04 50.29 0.0 199.6 361.8 2850.0 200.13 51.11 23.2 200.25 395.0
1 483.0 206.91 69.00 0.0 217.6 350.8 483.0 212.41 51.73 70.9 211.3 363.7 483.0 205.23 47.13 47.4 204.80 354.9
In [111]:
columns_to_show = ["Total day minutes", "Total eve minutes", "Total night minutes"]

df.groupby(["Churn"])[columns_to_show].agg([np.mean, np.std, np.min, np.max])
Out[111]:
Total day minutes Total eve minutes Total night minutes
mean std amin amax mean std amin amax mean std amin amax
Churn
0 175.18 50.18 0.0 315.6 199.04 50.29 0.0 361.8 200.13 51.11 23.2 395.0
1 206.91 69.00 0.0 350.8 212.41 51.73 70.9 363.7 205.23 47.13 47.4 354.9

DataFrame transformations¶

Like many other things in Pandas, adding columns to a DataFrame is doable in many ways.

For example, if we want to calculate the total number of calls for all users, let’s create the total_calls Series and paste it into the DataFrame:

In [112]:
total_calls = (
    df["Total day calls"]
    + df["Total eve calls"]
    + df["Total night calls"]
    + df["Total intl calls"]
)
df.insert(loc=len(df.columns), column="Total calls", value=total_calls)
# loc parameter is the number of columns after which to insert the Series object
# we set it to len(df.columns) to paste it at the very end of the dataframe
df.head()
Out[112]:
State Account length Area code International plan Voice mail plan Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes ... Total eve charge Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn Total calls
0 KS 128 415 False Yes 25 265.1 110 45.07 197.4 ... 16.78 244.7 91 11.01 10.0 3 2.70 1 0 303
1 OH 107 415 False Yes 26 161.6 123 27.47 195.5 ... 16.62 254.4 103 11.45 13.7 3 3.70 1 0 332
2 NJ 137 415 False No 0 243.4 114 41.38 121.2 ... 10.30 162.6 104 7.32 12.2 5 3.29 0 0 333
3 OH 84 408 True No 0 299.4 71 50.90 61.9 ... 5.26 196.9 89 8.86 6.6 7 1.78 2 0 255
4 OK 75 415 True No 0 166.7 113 28.34 148.3 ... 12.61 186.9 121 8.41 10.1 3 2.73 3 0 359

5 rows × 21 columns

In [113]:
df["Total charge"] = (
    df["Total day charge"]
    + df["Total eve charge"]
    + df["Total night charge"]
    + df["Total intl charge"]
)
df.head()
Out[113]:
State Account length Area code International plan Voice mail plan Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes ... Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn Total calls Total charge
0 KS 128 415 False Yes 25 265.1 110 45.07 197.4 ... 244.7 91 11.01 10.0 3 2.70 1 0 303 75.56
1 OH 107 415 False Yes 26 161.6 123 27.47 195.5 ... 254.4 103 11.45 13.7 3 3.70 1 0 332 59.24
2 NJ 137 415 False No 0 243.4 114 41.38 121.2 ... 162.6 104 7.32 12.2 5 3.29 0 0 333 62.29
3 OH 84 408 True No 0 299.4 71 50.90 61.9 ... 196.9 89 8.86 6.6 7 1.78 2 0 255 66.80
4 OK 75 415 True No 0 166.7 113 28.34 148.3 ... 186.9 121 8.41 10.1 3 2.73 3 0 359 52.09

5 rows × 22 columns

To delete columns or rows, use the drop method, passing the required indexes and the axis parameter (1 if you delete columns, and nothing or 0 if you delete rows). The inplace argument tells whether to change the original DataFrame. With inplace=False, the drop method doesn’t change the existing DataFrame and returns a new one with dropped rows or columns. With inplace=True, it alters the DataFrame.

In [114]:
# get rid of just created columns
df.drop(["Total charge", "Total calls"], axis=1, inplace=True)
# and here’s how you can delete rows
df.drop([1, 2]).head()
Out[114]:
State Account length Area code International plan Voice mail plan Number vmail messages Total day minutes Total day calls Total day charge Total eve minutes Total eve calls Total eve charge Total night minutes Total night calls Total night charge Total intl minutes Total intl calls Total intl charge Customer service calls Churn
0 KS 128 415 False Yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 3 2.70 1 0
3 OH 84 408 True No 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 0
4 OK 75 415 True No 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 0
5 AL 118 510 True No 0 223.4 98 37.98 220.6 101 18.75 203.9 118 9.18 6.3 6 1.70 0 0
6 MA 121 510 False Yes 24 218.2 88 37.09 348.5 108 29.62 212.6 118 9.57 7.5 7 2.03 3 0

Going through Telecom Dataset¶

In [115]:
pd.crosstab(df["Churn"], df["International plan"], margins=True)
Out[115]:
International plan False True All
Churn
0 2664 186 2850
1 346 137 483
All 3010 323 3333
In [118]:
import matplotlib.pyplot as plt
import seaborn as sns

# import some nice vis settings
sns.set()
# Graphics in the Retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina'

sns.countplot(x="International plan", hue="Churn", data=df);

We see that, with International Plan, the churn rate is much higher, which is an interesting observation! Perhaps large and poorly controlled expenses with international calls are very conflict-prone and lead to dissatisfaction among the telecom operator’s customers.

Next, let’s look at another important feature – Customer service calls. Let’s also make a summary table and a picture.

In [119]:
pd.crosstab(df["Churn"], df["Customer service calls"], margins=True)
Out[119]:
Customer service calls 0 1 2 3 4 5 6 7 8 9 All
Churn
0 605 1059 672 385 90 26 8 4 1 0 2850
1 92 122 87 44 76 40 14 5 1 2 483
All 697 1181 759 429 166 66 22 9 2 2 3333
In [120]:
sns.countplot(x="Customer service calls", hue="Churn", data=df);

Although it’s not so obvious from the summary table, it’s easy to see from the above plot that the churn rate increases sharply from 4 customer service calls and above.

Now let’s add a binary feature to our DataFrame – Customer service calls > 3. And once again, let’s see how it relates to churn.

In [121]:
df["Many_service_calls"] = (df["Customer service calls"] > 3).astype("int")

pd.crosstab(df["Many_service_calls"], df["Churn"], margins=True)
Out[121]:
Churn 0 1 All
Many_service_calls
0 2721 345 3066
1 129 138 267
All 2850 483 3333
In [122]:
sns.countplot(x="Many_service_calls", hue="Churn", data=df);

Data Visualization

Univariate visualization¶

Univariate analysis looks at one feature at a time. When we analyze a feature independently, we are usually mostly interested in the distribution of its values and ignore other features in the dataset.

Below, we will consider different statistical types of features and the corresponding tools for their individual visual analysis.

Quantitative features¶

Quantitative features take on ordered numerical values. Those values can be discrete, like integers, or continuous, like real numbers, and usually express a count or a measurement.

Histograms and density plots¶

The easiest way to take a look at the distribution of a numerical variable is to plot its histogram using the DataFrame’s method hist().

In [123]:
features = ["Total day minutes", "Total intl calls"]
df[features].hist(figsize=(10, 4));

A histogram groups values into bins of equal value range. The shape of the histogram may contain clues about the underlying distribution type: Gaussian, exponential, etc. You can also spot any skewness in its shape when the distribution is nearly regular but has some anomalies. Knowing the distribution of the feature values becomes important when you use Machine Learning methods that assume a particular type (most often Gaussian).

In the above plot, we see that the variable Total day minutes is normally distributed, while Total intl calls is prominently skewed right (its tail is longer on the right).

There is also another, often clearer, way to grasp the distribution: density plots or, more formally, Kernel Density Plots. They can be considered a smoothed version of the histogram. Their main advantage over the latter is that they do not depend on the size of the bins. Let’s create density plots for the same two variables:

In [124]:
df[features].plot(
    kind="density", subplots=True, layout=(1, 2), sharex=False, figsize=(10, 4)
);

It is also possible to plot a distribution of observations with seaborn’s distplot(). For example, let’s look at the distribution of Total day minutes. By default, the plot displays the histogram with the kernel density estimate (KDE) on top.

In [129]:
sns.distplot(df["Total intl calls"]);
/home/amirali/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Boxplot¶

Another useful type of visualization is a box plot. seaborn does a great job here:

In [130]:
sns.boxplot(x="Total intl calls", data=df);

Let’s see how to interpret a box plot. Its components are a box (obviously, this is why it is called a box plot), the so-called whiskers, and a number of individual points (outliers).

The box by itself illustrates the interquartile spread of the distribution; its length is determined by the 25th(Q1) and 75th(Q3) percentiles. The vertical line inside the box marks the median (50%) of the distribution.

The whiskers are the lines extending from the box. They represent the entire scatter of data points, specifically the points that fall within the interval (Q1−1.5⋅IQR,Q3+1.5⋅IQR) , where IQR=Q3−Q1

is the interquartile range.

Outliers that fall outside of the range bounded by the whiskers are plotted individually as black points along the central axis.

We can see that a large number of international calls is quite rare in our data.

Violin Plot¶

The last type of distribution plots that we will consider is a violin plot.

Look at the figures below. On the left, we see the already familiar box plot. To the right, there is a violin plot with the kernel density estimate on both sides.

In [131]:
_, axes = plt.subplots(1, 2, sharey=True, figsize=(6, 4))
sns.boxplot(data=df["Total intl calls"], ax=axes[0])
sns.violinplot(data=df["Total intl calls"], ax=axes[1]);

The difference between the box and violin plots is that the former illustrates certain statistics concerning individual examples in a dataset while the violin plot concentrates more on the smoothed distribution as a whole.

In our case, the violin plot does not contribute any additional information about the data as everything is clear from the box plot alone.

Categorical and binary features¶

Categorical features take on a fixed number of values. Each of these values assigns an observation to a corresponding group, known as a category, which reflects some qualitative property of this example. Binary variables are an important special case of categorical variables when the number of possible values is exactly 2. If the values of a categorical variable are ordered, it is called ordinal.

Frequency table¶

Let’s check the class balance in our dataset by looking at the distribution of the target variable: the churn rate. First, we will get a frequency table, which shows how frequent each value of the categorical variable is. For this, we will use the value_counts() method:

In [132]:
df["Churn"].value_counts()
Out[132]:
0    2850
1     483
Name: Churn, dtype: int64

In our case, the data is not balanced; that is, our two target classes, loyal and disloyal customers, are not represented equally in the dataset. Only a small part of the clients canceled their subscription to the telecom service. As we will see in the following articles, this fact may imply some restrictions on measuring the classification performance, and, in the future, we may want to additionally penalize our model errors in predicting the minority “Churn” class.

Bar plot¶

The bar plot is a graphical representation of the frequency table. The easiest way to create it is to use the seaborn’s function countplot(). There is another function in seaborn that is somewhat confusingly called barplot() and is mostly used for representation of some basic statistics of a numerical variable grouped by a categorical feature.

Let’s plot the distributions for two categorical variables:

In [133]:
_, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))

sns.countplot(x="Churn", data=df, ax=axes[0])
sns.countplot(x="Customer service calls", data=df, ax=axes[1]);

Multivariate visualization¶

Multivariate plots allow us to see relationships between two and more different variables, all in one figure. Just as in the case of univariate plots, the specific type of visualization will depend on the types of the variables being analyzed.

Quantitative vs. Quantitative¶

Correlation matrix¶

Let’s look at the correlations among the numerical variables in our dataset. This information is important to know as there are Machine Learning algorithms (for example, linear and logistic regression) that do not handle highly correlated input variables well.

First, we will use the method corr() on a DataFrame that calculates the correlation between each pair of features. Then, we pass the resulting correlation matrix to heatmap() from seaborn, which renders a color-coded matrix for the provided values:

In [134]:
# Drop non-numerical variables
numerical = list(
    set(df.columns)
    - set(
        [
            "State",
            "International plan",
            "Voice mail plan",
            "Area code",
            "Churn",
            "Customer service calls",
        ]
    )
)

# Calculate and plot
corr_matrix = df[numerical].corr()
sns.heatmap(corr_matrix);

From the colored correlation matrix generated above, we can see that there are 4 variables such as Total day charge that have been calculated directly from the number of minutes spent on phone calls (Total day minutes). These are called dependent variables and can therefore be left out since they do not contribute any additional information. Let’s get rid of them:

In [135]:
numerical = list(
    set(numerical)
    - set(
        [
            "Total day charge",
            "Total eve charge",
            "Total night charge",
            "Total intl charge",
        ]
    )
)

Scatter plot¶

The scatter plot displays values of two numerical variables as Cartesian coordinates in 2D space. Scatter plots in 3D are also possible.

Let’s try out the function scatter() from the matplotlib library:

In [136]:
plt.scatter(df["Total day minutes"], df["Total night minutes"]);

We get an uninteresting picture of two normally distributed variables. Also, it seems that these features are uncorrelated because the ellipse-like shape is aligned with the axes.

There is a slightly fancier option to create a scatter plot with the seaborn library:

In [137]:
sns.jointplot(x="Total day minutes", y="Total night minutes", data=df, kind="scatter");

The function jointplot() plots two histograms that may be useful in some cases.

Using the same function, we can also get a smoothed version of our bivariate distribution:

In [138]:
sns.jointplot(
    "Total day minutes", "Total night minutes", data=df, kind="kde", color="g"
);
/home/amirali/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Scatterplot matrix¶

In some cases, we may want to plot a scatterplot matrix such as the one shown below. Its diagonal contains the distributions of the corresponding variables, and the scatter plots for each pair of variables fill the rest of the matrix.

In [139]:
# `pairplot()` may become very slow with the SVG format
%config InlineBackend.figure_format = 'png'
sns.pairplot(df[numerical]);

Quantitative vs. Categorical¶

In this section, we will make our simple quantitative plots a little more exciting. We will try to gain new insights for churn prediction from the interactions between the numerical and categorical features.

More specifically, let’s see how the input variables are related to the target variable Churn.

Previously, you learned about scatter plots. Additionally, their points can be color or size coded so that the values of a third categorical variable are also presented in the same figure. We can achieve this with the scatter() function seen above, but, let’s try a new function called lmplot() and use the parameter hue to indicate our categorical feature of interest:

In [140]:
sns.lmplot(
    "Total day minutes", "Total night minutes", data=df, hue="Churn", fit_reg=False
);
/home/amirali/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Categorical vs. Categorical¶

As we saw earlier in this article, the variable Customer service calls has few unique values and, thus, can be considered either numerical or ordinal. We have already seen its distribution with a count plot. Now, we are interested in the relationship between this ordinal feature and the target variable Churn.

Let’s look at the distribution of the number of calls to customer service, again using a count plot. This time, let’s also pass the parameter hue=Churn that adds a categorical dimension to the plot:

In [141]:
sns.countplot(x="Customer service calls", hue="Churn", data=df);

An observation: the churn rate increases significantly after 4 or more calls to customer service.

Now, let’s look at the relationship between Churn and the binary features, International plan and Voice mail plan.

In [142]:
_, axes = plt.subplots(1, 2, sharey=True, figsize=(10, 4))

sns.countplot(x="International plan", hue="Churn", data=df, ax=axes[0])
sns.countplot(x="Voice mail plan", hue="Churn", data=df, ax=axes[1]);

Contingency table¶

In addition to using graphical means for categorical analysis, there is a traditional tool from statistics: a contingency table, also called a cross tabulation. It shows a multivariate frequency distribution of categorical variables in tabular form. In particular, it allows us to see the distribution of one variable conditional on the other by looking along a column or row.

Let’s try to see how Churn is related to the categorical variable State by creating a cross tabulation:

In [143]:
pd.crosstab(df["State"], df["Churn"]).T
Out[143]:
State AK AL AR AZ CA CO CT DC DE FL ... SD TN TX UT VA VT WA WI WV WY
Churn
0 49 72 44 60 25 57 62 49 52 55 ... 52 48 54 62 72 65 52 71 96 68
1 3 8 11 4 9 9 12 5 9 8 ... 8 5 18 10 5 8 14 7 10 9

2 rows × 51 columns

More Detailed to Plotting¶

In [5]:
# Increase the default plot size and set the color scheme

plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["image.cmap"] = "viridis"

df = pd.read_csv("Video_Games_Sales").dropna()
print(df.shape)
(6825, 16)

1. DataFrame.plot()¶

In [157]:
df[[x for x in df.columns if "Sales" in x] + ["Year_of_Release"]].groupby(
    "Year_of_Release"
).sum().plot();

2. Seaborn¶

Now, let’s move on to the Seaborn library. seaborn is essentially a higher-level API based on the matplotlib library. Among other things, it differs from the latter in that it contains more adequate default settings for plotting. By adding import seaborn as sns; sns.set() in your code, the images of your plots will become much nicer. Also, this library contains a set of complex tools for visualization that would otherwise (i.e. when using bare matplotlib) require quite a large amount of code.

pairplot()¶

Let’s take a look at the first of such complex plots, a pairwise relationships plot, which creates a matrix of scatter plots by default. This kind of plot helps us visualize the relationship between different variables in a single output.

In [158]:
# `pairplot()` may become very slow with the SVG format
%config InlineBackend.figure_format = 'png'
sns.pairplot(
    df[["Global_Sales", "Critic_Score", "Critic_Count", "User_Score", "User_Count"]]
);

As you can see, the distribution histograms lie on the diagonal of the matrix. The remaining charts are scatter plots for the corresponding pairs of features.

distplot()¶

It is also possible to plot a distribution of observations with seaborn’s distplot(). For example, let’s look at the distribution of critics’ ratings: Critic_Score. By default, the plot displays a histogram and the kernel density estimate.

In [159]:
sns.distplot(df["Critic_Score"]);
/home/amirali/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

jointplot()¶

To look more closely at the relationship between two numerical variables, you can use joint plot, which is a cross between a scatter plot and histogram. Let’s see how the Critic_Score and User_Score features are related.

In [160]:
sns.jointplot(x="Critic_Score", y="User_Score", data=df, kind="scatter");

boxplot()¶

Another useful type of plot is a box plot. Let’s compare critics’ ratings for the top 5 biggest gaming platforms.

In [161]:
top_platforms = (
    df["Platform"].value_counts().sort_values(ascending=False).head(5).index.values
)
sns.boxplot(
    y="Platform",
    x="Critic_Score",
    data=df[df["Platform"].isin(top_platforms)],
    orient="h",
);

heatmap()¶

The last type of plot that we will cover here is a heat map. A heat map allows you to view the distribution of a numerical variable over two categorical ones. Let’s visualize the total sales of games by genre and gaming platform.

In [162]:
platform_genre_sales = (
    df.pivot_table(
        index="Platform", columns="Genre", values="Global_Sales", aggfunc=sum
    )
    .fillna(0)
    .applymap(float)
)
sns.heatmap(platform_genre_sales, annot=True, fmt=".1f", linewidths=0.5);

3. Plotly¶

We have examined some visualization tools based on the matplotlib library. However, this is not the only option for plotting in Python. Let’s take a look at the plotly library. Plotly is an open-source library that allows creation of interactive plots within a Jupyter notebook without having to use Javascript.

The real beauty of interactive plots is that they provide a user interface for detailed data exploration. For example, you can see exact numerical values by mousing over points, hide uninteresting series from the visualization, zoom in onto a specific part of the plot, etc.

Before we start, let’s import all the necessary modules and initialize plotly by calling the init_notebook_mode() function.

In [6]:
import plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot

init_notebook_mode(connected=True)

Line Plot¶

In [7]:
years_df = (
    df.groupby("Year_of_Release")[["Global_Sales"]]
    .sum()
    .join(df.groupby("Year_of_Release")[["Name"]].count())
)
years_df.columns = ["Global_Sales", "Number_of_Games"]

Figure is the main class and a work horse of visualization in plotly. It consists of the data (an array of lines called traces in this library) and the style (represented by the layout object). In the simplest case, you may call the iplot function to return only traces.

The show_link parameter toggles the visibility of the links leading to the online platform plot.ly in your charts. Most of the time, this functionality is not needed, so you may want to turn it off by passing show_link=False to prevent accidental clicks on those links.

In [8]:
# Create a line (trace) for the global sales
trace0 = go.Scatter(x=years_df.index, y=years_df["Global_Sales"], name="Global Sales")

# Create a line (trace) for the number of games released
trace1 = go.Scatter(
    x=years_df.index, y=years_df["Number_of_Games"], name="Number of games released"
)

# Define the data array
data = [trace0, trace1]

# Set the title
layout = {"title": "Statistics for video games"}

# Create a Figure and plot it
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=False)

As an option, you can save the plot in an html file:

In [ ]:
plotly.offline.plot(fig, filename="years_stats.html", show_link=False, auto_open=False);

Bar chart¶

Let’s use a bar chart to compare the market share of different gaming platforms broken down by the number of new releases and by total revenue.

In [9]:
# Do calculations and prepare the dataset
platforms_df = (
    df.groupby("Platform")[["Global_Sales"]]
    .sum()
    .join(df.groupby("Platform")[["Name"]].count())
)
platforms_df.columns = ["Global_Sales", "Number_of_Games"]
platforms_df.sort_values("Global_Sales", ascending=False, inplace=True)
In [10]:
# Create a bar for the global sales
trace0 = go.Bar(
    x=platforms_df.index, y=platforms_df["Global_Sales"], name="Global Sales"
)

# Create a bar for the number of games released
trace1 = go.Bar(
    x=platforms_df.index,
    y=platforms_df["Number_of_Games"],
    name="Number of games released",
)

# Get together the data and style objects
data = [trace0, trace1]
layout = {"title": "Market share by gaming platform"}

# Create a `Figure` and plot it
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=False)

Box plot¶

plotly also supports box plots. Let’s consider the distribution of critics’ ratings by the genre of the game.

In [11]:
data = []

# Create a box trace for each genre in our dataset
for genre in df.Genre.unique():
    data.append(go.Box(y=df[df.Genre == genre].Critic_Score, name=genre))

# Visualize
iplot(data, show_link=False)

References¶

  • Stanford Introduction to Python Course

  • Intro to Data Visualization mlcourse.ai

More Useful Resources¶

  • “Plotly for interactive plots” - a tutorial by Alexander Kovalev within mlcourse.ai
  • “Bring your plots to life with Matplotlib animations” - a tutorial by Kyriacos Kyriacou within mlcourse.ai
  • “Some details on Matplotlib” - a tutorial by Ivan Pisarev within mlcourse.ai
In [ ]: